Handling Incomplete Categorical Data for Supervised Learning

نویسندگان

Been-Chian Chien

Cheng-Feng Lu

Steen J. Hsu

چکیده

Classification is an important research topic in knowledge discovery. Most of the researches on classification concern that a complete dataset is given as a training dataset and the test data contain all values of attributes without missing. Unfortunately, incomplete data usually exist in real-world applications. In this paper, we propose new handling schemes of learning classification models from incomplete categorical data. Three methods based on rough set theory are developed and discussed for handling incomplete training data. The experiments were made and the results were compared with previous methods making use of a few famous classification models to evaluate the performance of the proposed handling schemes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Dynamic Longitudinal Categorical Data in Incomplete Contingency Tables Using Capture-Recapture Sampling: A case Study of Semi-Concentrated Doctoral Exam

Abstract. In this paper, dynamic longitudinal categorical data and estimation of their parameters in incomplete contingency tables are evaluated. To apply the proposed method, a study has been conducted on the data of the semi-concentrated doctoral exam of the National Organization for Educational Testing (NOET). The results of studies such as the obtained confidence intervals and calculating t...

متن کامل

Missing Data Imputation for Supervised Learning

This paper compares methods for imputing missing categorical data for supervised learning tasks. The ability of researchers to accurately fit a model and yield unbiased estimates may be compromised by missing data, which are prevalent in survey-based social science research. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on ...

متن کامل

A Simple Yet Fast Clustering Approach for Categorical Data

Categorical data has always posed a challenge in data analysis through clustering. With the increasing awareness about Big data analysis, the need for better clustering methods for categorical data and mixed data has arisen. The prevailing clustering algorithms are not suitable for clustering categorical data majorly because the distance functions used for continuous data are not applicable for...

متن کامل

A Semi-supervised Learning Framework to Cluster Mixed Data Types

We propose a semi-supervised framework to handle diverse data formats or data with mixedtype attributes. Our preliminary results in clustering data with mixed numerical and categorical attributes show that the proposed semi-supervised framework gives better clustering results in the categorical domain. Thus the seeds obtained from clustering the numerical domain give an additional knowledge to ...

متن کامل

A semi-supervised regression model for mixed numerical and categorical variables

In this paper, we develop a semi-supervised regression algorithm to analyze data sets which contain both categorical and numerical attributes. This algorithm partitions the data sets into several clusters and at the same time fits a multivariate regression model to each cluster. This framework allows one to incorporate both multivariate regression models for numerical variables (supervised lear...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Handling Incomplete Categorical Data for Supervised Learning

نویسندگان

چکیده

منابع مشابه

Analysis of Dynamic Longitudinal Categorical Data in Incomplete Contingency Tables Using Capture-Recapture Sampling: A case Study of Semi-Concentrated Doctoral Exam

Missing Data Imputation for Supervised Learning

A Simple Yet Fast Clustering Approach for Categorical Data

A Semi-supervised Learning Framework to Cluster Mixed Data Types

A semi-supervised regression model for mixed numerical and categorical variables

عنوان ژورنال:

اشتراک گذاری